R version 3.6.3 (2020-02-29) and the R-packages tidyverse [Version 1.3.0], rlang [Version 0.4.6], here [Version 0.1], brms [Version 2.12.0], tidybayes [Version 2.0.1], bayestestR [Version 0.5.2], modelr [Version 0.1.6], ggforce [Version 0.3.1], ggrepel [Version 0.8.2], ggridges [Version 0.5.2], irr [Version 0.84.1], and kableExtra [Version 1.1.0] were used for data preparation, analysis, and presentation.

Coding

Responses from the vocabulary test (30 items) and reading responses from the testing phase (42 items) were transcribed and coded by two coders (GPW and VK) blind to each participant’s condition. The coding convention, which was based on the CPSAMPA (Marian et al., 2012) simplified notation of IPA characters is described in detail in Williams et al. (2020). For all coded oral responses as well as for all spellings, length-normalised Levenshtein edit distances (nLEDs) to the target string were computed and used as the dependent variable to assess performance. Such edit distances are computed by dividing the number of insertions, substitutions, and deletions required to transform one string (e.g. a participant’s input) into another (e.g. the target word) by the larger of the two string lengths (Levenshtein, 1966). Edit distances constitute a more gradual and fine-grained performance measure than error rates that can distinguish near-matches from entirely erroneous productions. When literacy training in the Dialect Literacy condition targeted the dialect variety, dialect variants were adopted as targets for computation of nLEDs.

Inter-coder reliability was computed by obtaining intra-class correlations between the two coders’ nLEDs, using the irr R-package (Gamer et al., 2019). We used a single-score, absolute agreement, two-way random effects model based on the summed nLEDs for each participant. Inter-coder reliability was F(319.000, 8.053) = 752.805, \(p\) < .001, ICC = 0.996, 95% CI = [0.985; 0.998]. The 95% confidence interval around the parameter estimate indicates that the ICC falls above the bound of .90, which suggests excellent reliability across coders (Koo & Li, 2016). Whenever there was a discrepancy between the coders further analyses were based on the smaller of the two nLEDs thereby adopting a lenient coding criterion justified by the rationale that a participant response should be regarded acceptable if at least one of the coders can match it to the target as closely as possible.

Modelling using Bayesian Zero-One Inflated Beta Distributions

The data were analysed using Bayesian distributional models in the brms R-package. Specifically, these models assume the data are drawn from a zero-one inflated Beta distribution. This models the data as a Beta distribution for nLEDs excluding 0 and 1, and a Bernoulli distribution for nLEDs of 0 and 1. Thus, predictors in the model can affect four distributional parameters: \(\mu\) (mu), the mean of the nLEDs excluding 0 and 1; \(\phi\) (phi), the precision (i.e. spread) of the nLEDs excluding 0 and 1; \(\alpha\) (alpha; termed zoi - or zero-one inflation in brms) the probability of an nLED of 0 or 1; and \(\gamma\) (gamma; termed coi - or conditional-one inflation in brms), the conditional probability of a 1 given a 0 or 1 has been observed. Larger values for these parameters are associated with (a) higher mean nLEDs in the range exluding 0 and 1, (b) tighter distributions of the nLEDs in the range excluding 0 and 1 (i.e. less variance), (c) more zero-one inflation in nLEDs, and (d) more one-inflation given zero-inflation in nLEDs. Predictors in this model can influence any and all distributional parameters in the model at once. For these models, a logit link is used for the \(\mu\), \(\alpha\), and \(\gamma\) distributional parameters, and a log link is used for the \(\phi\) distributional parameter.

These models account for the fact that nLEDs are bounded between 0 and 1, with inflated counts at these bounds on a trial-by-trial basis, and that on an individual trial the observations making up an nLED are autocorrelated. (For example, if the previous letter in a participant’s input requires an insertion, substitution, or deletion, then the next letter is more likely to also require one rather than to remain unchanged.) Crucially, in contrast to general linear models and linear mixed effects models which assume a Gaussian data generation process, these models do not make predictions outside the possible range of values and accurately capture the larger densities at extreme values. This more accurately accounts for the multitude of ways in which nLEDs can be generated when compared to fitting assuming only one underlying distribution. For example, with perfect recollection nLEDs are likely to be at or near 0, with varying levels of decoding they are likely to be between 0 and 1, and with guessing they are likely to be close to or at 1.

At the time of writing, distributional models of this nature are only available for hierarchical data using the brms R-package, which requires model fitting to be performed using a Bayesian framework. As an additional benefit, Bayesian models do not suffer from the non-convergence often associated with modelling complex analyses under a Frequentist framework. Given that these models return parameter estimates for the four distributional terms – the values of which dependent on one-another – drawing inferences from direct inspection of parameter estimates is extremely difficult (if impossible for such complex models). By using Bayesian methods, this allows for inferences to be made based on draws from the joint posterior under the conditions of interest. This not only allows for inferences to be made based on simple summaries of the the data on the nLED scale, but also allows for uncertainty surrounding all terms to be propogated into these summaries.

Model Fitting

Model Specification and Analysis

Three models were fitted in total: (1) assessing performance across conditions during the vocabulary test prior to literacy training; (2) assessing performance across conditions during the testing phase following literacy training; and (3) assessing performance across conditions during the testing phase following literacy training using the vocabulary test performance as a predictor. This latter model was not pre-registered, but instead serves an exploratory purpose to determine whether or not any effect of dialect exposure is mediated by initial performance. In all models, estimates population-level and group-level effects are estimated for all distributional parameters, with group-level effects correlated across all parameters.

The models were described as follows:

  • Vocabulary Test Model: nLEDs are predicted by population-level (fixed) effects of Variety Exposure condition (with four levels: Variety Match, Mismatch, Mismatch Social, and Dialect Learning) Word Type (with two levels: Contrative and Non-contrastive), and the interaction between them, and by group-level (random) effects of random intercepts and slopes of Word Type by participants, and random intercepts and slopes of Variety Exposure by item.

  • Testing Model: nLEDs are predicted by population-level (fixed) effects of Task (with two levels: Reading and Spelling), Variety Exposure condition, and Word Type and the interaction between them, and by group-level (random) effects of random intercepts and slopes of Task and Word Type by participant, and random intercepts and slopes of Variety Exposure by item. Crucially, the interaction between the group-level effects by participant did not include the interaction between them in order to reduce model complexity.

  • Exploratory Covariate Testing Model: nLEDs are predicted by population-level (fixed) effects of mean nLED during the Vocabulary Test, Task, Variety Exposure condition, Word Type, and the interaction between them, and by group-level (random) effects of random intercepts and slopes of Task and Word Type by participant, and random intercepts and slopes of mean nLED during the Vocabulary Test and Variety Exposure by items. Again, the interaction between the group-level effects by participant did not include the interaction between them in order to reduce model complexity.

In all models, the approach was to use weakly informative, regularising priors for fitting. Where divergences were detected during fitting, these priors were adjusted, typically placing less prior weight on extreme values. Larely, the priors were selected to allow the posterior to be determined primarily by the data. Full details of the priors and posterior predictive checks are provided in Appendix A. Model summaries for the population-level (fixed) effects for all fitted models can be found in Appendix B. To answer questions pertaining to our pre-registered hypotheses, and to generate plots for these summaries, we used draws from the posterior for different combinations of conditions using the tidybayes [Version 2.0.1] R-package.

In all following plots and reported statistics, summaries are provided for for the joint posterior of the model taking into account all distributional parameters during sampling. This provides an overall nLED for any comparison, rather than separate estimates of nLEDs between the bounds of 0 and 1 and for the extremes of 0 and 1. For reported results in tables, estimates are based on the median and credible interval around the median. The median was selected to summarise these models over the mean as this method is more robust to distributions with more than one mode. Thus, we do not provide individual statistics and plots for the individual distributional terms (e.g. for zero-one inflation, or conditional-one inflation) as we did not specify any hypotheses related to these individual terms. Instead, the zero-one inflated Beta models are used purely to improve model fit and to make more accurate predictions about the overall differences in nLEDs across conditions. Ninety percent credible intervals are used to summarise uncertainty in the estimates as these intervals are more stable than wider intervals when given a limited number of draws from the posterior (Kruschke, 2014).

The differences in nLEDs between conditions were compared using the compare_levels() function from the tidybayes [Version 2.0.1] R-package. This allows for a direct comparison of differences between groups, which provides a more accurate and reliable method of establishing group differences than visual inspection of whether credible intervals overlap from estimates of the individual groups (Schenker & Gentleman, 2001). Here, the posterior is summarised as the median and 90% credible interval around the median.

To determine support for hypotheses using these estimates, the probability of direction \(P(direction)\), or pd, is provided as calcualted using the bayestestR [Version 0.5.2] R-package. This is defined as the proportion of the posterior that is of the same sign as the median. In previous simulations, the pd has been found to be linearly related to the frequentist p-value (Makowski et al., 2019). The pd therefore provides an index of the existence of an effect outlining certainty in whether an effect is positive or negative. This can be used to ultimately reject the null hypothesis, but like the frequentist p-value does not give a reliable estimate of evidence in support of the null hypothesis. Unlike the frequentist p-value, a “significant” effect here is typically associated with a larger proportion of the posterior being of the same sign as the median (e.g. a p-value of <.05 is akin to a pd of >.95).

Additional hypothesis tests are provided in the form of Region of Practical Equivalence (ROPE) analyses from these draws also using the bayestestR [Version 0.5.2] R-package. This defines an area around the point null that is practically equivalent to zero for assessing evidence in support of the null hypothesis (Krushke, 2014). Here, the bounds of the ROPE range are defined as half the smallest effect reported in the Williams et al. (2020) parameter estimates and intervals report the 90% highest density interval (HDI) of the posterior. We report the proportion of the HDI contained within the ROPE region along with bounds of this interval. Where HDIs are entirely contained by the equivalence bounds, equivalence is accepted. Where HDIs are entirely outside the equivalence bounds, equivalence is rejected. Uncertainty is assigned to any HDIs that cross the equivalence bounds in either (or both) directions. The HDI differs from the equal tailed intervals used for summary statistics in that values within the range are always more probable than values outside of the range, and the interval need not exclude an equal amount of the distribution towards both tails. With symmetric distributions, the two methods produce similar results. For completeness, we report for 90% CIs and HDIs.

In plots, posterior medians and 80% and 90% credible intervals are provided for different conditions. Table summaries also provide posterior medians with 90% credible intervals. In the tables of population level (fixed) effects, \(\hat{R}\) is a measure of convergence for within- and between-chain estimates, with values closer to 1 being preferable. The bulk and tail effective sample sizes give diagnostics of the number of draws which contain the same amount of information as the dependent sample (Vehtari et al., 2019), with higher values being preferable. The tail effective sample size is determined at the 5% and 95% quantiles, while the bulk is determined at values in between these quantiles.

Vocabulary Test Model

Word Type by Variety Exposure

We tested for any differences in performance for different word types across conditions during the vocabulary testing phase. These results are summarised as mean differences between word types with error bars adjusted for within-subjects effects using the Morey (208) correction along with densities and points for mean scores for each participant below.

Given the large variability in performance across participants, up to and including the bounds of the dependent variable, this demonstrates the need for modelling such data using zero-one inflated beta model. Of most interest, however, this plot shows no substantial differences by word type across conditions. Posterior medians with 80% and 90% credible intervals are shown for each word type within each variety exposure condition below.

These plots show a similar trend in the estimate of effects as provided in the point estimates of the raw data, demonstrating that the choice of priors does not substantially skew results in any direction. Posterior medians and credible intervals summaries of the depicted effects are provided in the table below.

Variety Exposure Word Type Median Percentile Interval
No Dialect Non-Contrastive 0.675 [0.611, 0.725]
No Dialect Contrastive 0.700 [0.648, 0.741]
Dialect Non-Contrastive 0.623 [0.555, 0.679]
Dialect Contrastive 0.652 [0.593, 0.700]
Dialect & Social Non-Contrastive 0.645 [0.587, 0.692]
Dialect & Social Contrastive 0.686 [0.637, 0.726]
Dialect Literacy Non-Contrastive 0.635 [0.561, 0.695]
Dialect Literacy Contrastive 0.689 [0.634, 0.731]

These results show that performance is generally poor in all variety exposure conditions in the vocabulary testing phase, with all median nLEDs at or above 0.623.

To explore whether there are any reliable differences in performance for each word type within the variety exposure conditions, posterior draws were compared across each level of word type within the variety exposure conditions. Posterior medians with 80% and 90% credible intervals are shown for the comparison between each word type within each variety exposure condition below.

Posterior medians and credible intervals for this comparison are provided in the table below.

Variety Exposure Word Type Median Percentile Interval Rope Percentage HDI Interval P(Direction)
No Dialect Contrastive - Non-Contrastive 0.025 [-0.03, 0.08] 0.387 [-0.03, 0.08] 0.774
Dialect Contrastive - Non-Contrastive 0.029 [-0.03, 0.09] 0.351 [-0.03, 0.09] 0.794
Dialect & Social Contrastive - Non-Contrastive 0.040 [-0.01, 0.09] 0.229 [-0.01, 0.09] 0.911
Dialect Literacy Contrastive - Non-Contrastive 0.053 [-0.01, 0.12] 0.161 [-0.01, 0.12] 0.920
Note:
ROPE range = [-0.02, 0.02]. ROPE determined at the 90% CI of the HDI.

In all instances there is some evidence that performance is better for non-contrastive words relative to contrastive words. In the Variety Mismatch Social and Dialect Literacy conditions, the difference between these two scores has an approximately 91% and 92% probability of being positive. However, the remaining two comparisons have less than 80% probability of being positive. Given that in all cases the 90% credible interval spans 0 there is insufficient evidence to rule out an effect in the opposite direction. Together, these findings suggest only weak evidence for any difference between our main measures of performance during the vocabulary testing phase prior to further training and testing.

Testing Phase Model

As previous research has shown that effects reported in the training phase are secondary to the testing phase, in the interest of brevity we only resport findings from the testing phase. Indeed, given the large cost in time for transcribing individual trials across all participants, only the reading data for this task have been transcribed. However, the spelling data during the training phase, and all other results, are freely avaialble at https://osf.io/7ct9x/.

As with the Vocabulary Test Model, to answer questions pertaining to our pre-registered hypotheses, and to generate plots for these summaries, we used draws from the posterior for different combinations of conditions. Similarly, hypothesis tests are provided in the form of ROPE and pd.

Word Type by Task and Variety Exposure

We tested whether there are any differences in performance for different word types across tasks for the variety exposure conditions during the testing phase. These results are summarised using the same method employed in the exposure phase model.

Posterior medians with 80% and 90% credible intervals are shown for each word type within each task and variety exposure condition below.

Posterior medians and credible intervals are provided in the table below.

Task Variety Exposure Word Type Median Percentile Interval
Reading No Dialect Non-Contrastive 0.178 [0.138, 0.225]
Reading No Dialect Contrastive 0.189 [0.147, 0.236]
Reading Dialect Non-Contrastive 0.173 [0.132, 0.221]
Reading Dialect Contrastive 0.215 [0.170, 0.262]
Reading Dialect & Social Non-Contrastive 0.193 [0.150, 0.240]
Reading Dialect & Social Contrastive 0.233 [0.185, 0.283]
Reading Dialect Literacy Non-Contrastive 0.166 [0.126, 0.210]
Reading Dialect Literacy Contrastive 0.262 [0.215, 0.312]
Spelling No Dialect Non-Contrastive 0.266 [0.220, 0.314]
Spelling No Dialect Contrastive 0.282 [0.234, 0.338]
Spelling Dialect Non-Contrastive 0.273 [0.227, 0.325]
Spelling Dialect Contrastive 0.274 [0.226, 0.330]
Spelling Dialect & Social Non-Contrastive 0.296 [0.247, 0.348]
Spelling Dialect & Social Contrastive 0.295 [0.242, 0.358]
Spelling Dialect Literacy Non-Contrastive 0.265 [0.217, 0.319]
Spelling Dialect Literacy Contrastive 0.325 [0.269, 0.398]

Overall performance is better in the testing phase than the vocabulary testing phase, with the highest median nLED being 0.325.

We used the same method as in the vocabulary testing phase to directly compare performance for contrastive words relative to non-contrastive words within each task and variety exposure condition. Posterior medians with 80% and 90% credible intervals are shown for the comparison between each word type within each task and variety exposure condition below.

Posterior medians and credible intervals for this comparison are provided in the table below.

Task Variety Exposure Word Type Median Percentile Interval Rope Percentage HDI Interval P(Direction)
Reading No Dialect Contrastive - Non-Contrastive 0.010 [-0.04, 0.06] 0.555 [-0.03, 0.06] 0.648
Reading Dialect Contrastive - Non-Contrastive 0.041 [-0.00, 0.09] 0.194 [-0.00, 0.09] 0.931
Reading Dialect & Social Contrastive - Non-Contrastive 0.039 [-0.01, 0.09] 0.226 [-0.01, 0.09] 0.914
Reading Dialect Literacy Contrastive - Non-Contrastive 0.095 [0.05, 0.14] 0.000 [0.05, 0.14] 0.999
Spelling No Dialect Contrastive - Non-Contrastive 0.016 [-0.03, 0.06] 0.499 [-0.03, 0.06] 0.714
Spelling Dialect Contrastive - Non-Contrastive 0.001 [-0.05, 0.05] 0.570 [-0.04, 0.05] 0.512
Spelling Dialect & Social Contrastive - Non-Contrastive 0.000 [-0.05, 0.05] 0.533 [-0.05, 0.05] 0.502
Spelling Dialect Literacy Contrastive - Non-Contrastive 0.060 [0.01, 0.12] 0.070 [0.00, 0.12] 0.972
Note:
ROPE range = [-0.02, 0.02]. ROPE determined at the 90% CI of the HDI.

While nLEDs are generally higher for contrastive words for all variety exposure conditions and for both tasks, this difference is noticeably smaller in the Variety Match condition and larger in the Dialect Literacy condition. A direct comparison between each level of word type by variety exposure condition shows that the 90% credible intervals around difference scores for nLEDs contains zero in all contrasts except for the Dialect Literacy condition, in which performance is worse for contrastive words relative to non-contrastive words across both reading and spelling tasks.

Observing the pattern of results in the figure above, there is evidence that the 80% credible interval for the effect of word type does not cross zero for the Dialect and Dialect & Social conditions in the reading task. This indicates a weaker effect of word type than in the Dialect Literacy condition but in the same direction, such that performance is generally worse for contrastive words relative to non-contrastive words. Indeed, for the reading task in the Dialect and Dialect & Social conditions, pds are 0.931 and 0.914, indicating that over 90% of the posterior is of the median’s sign. By comparison this effect is noticeably stronger in the Dialect Literacy condition in which the pd is 0.999. For the spelling task, there is evidence of a word type effect in the Dialect Literacy condition only, in which the pd is 0.972. All other contrasts have a less than 72% probability of being the same sign as the median.

Together, this suggests that there is evidence of an effect of word type by which performance is worse for contrastive words relative to non-contrastive words for the reading task in all dialect conditions, and for the spelling task in the Dialect Literacy condition only. However, evidence for an effect of word type in the Dialect and Dialect & Social conditions is noticeably weaker than that of the Dialect Literacy condition in the reading task.

Novel Words by Task and Variety Exposure

We tested whether there are any differences in performance for novel words across tasks for the variety exposure conditions during the testing phase. These results are summarised using the same method employed in previous analyses.

Posterior medians with 80% and 90% credible intervals are shown for each word type within each task and variety exposure condition below.

Posterior medians and credible intervals are provided in the table below.

Task Variety Exposure Median Percentile Interval
Reading No Dialect 0.225 [0.167, 0.304]
Reading Dialect 0.220 [0.168, 0.277]
Reading Dialect & Social 0.212 [0.156, 0.283]
Reading Dialect Literacy 0.244 [0.188, 0.308]
Spelling No Dialect 0.274 [0.223, 0.325]
Spelling Dialect 0.290 [0.220, 0.405]
Spelling Dialect & Social 0.279 [0.217, 0.368]
Spelling Dialect Literacy 0.292 [0.228, 0.392]

Overall performance is better in the reading task than the spelling task, with maximumim median nLEDs of 0.244 and 0.292 respectively, both of which are found in the Dialect Literacy condition. We used the same method as in previous analyses to directly compare performance for novel words across each Variety Exposure condition and within each Task. Posterior medians with 80% and 90% credible intervals are shown for the comparison for Novel Words between each Variety Exposure condition within each Task below.

Posterior medians and credible intervals for this comparison are provided in the table below.

Task Variety Exposure Median Percentile Interval Rope Percentage HDI Interval P(Direction)
Reading No Dialect - Dialect 0.005 [-0.06, 0.09] 0.672 [-0.06, 0.08] 0.553
Reading No Dialect - Dialect & Social 0.013 [-0.06, 0.09] 0.623 [-0.07, 0.09] 0.619
Reading No Dialect - Dialect Literacy -0.018 [-0.09, 0.06] 0.581 [-0.10, 0.06] 0.662
Reading Dialect - Dialect & Social 0.008 [-0.06, 0.07] 0.687 [-0.06, 0.07] 0.580
Reading Dialect - Dialect Literacy -0.024 [-0.09, 0.04] 0.616 [-0.09, 0.04] 0.731
Reading Dialect & Social - Dialect Literacy -0.032 [-0.10, 0.04] 0.521 [-0.10, 0.04] 0.772
Spelling No Dialect - Dialect -0.017 [-0.13, 0.06] 0.555 [-0.11, 0.07] 0.636
Spelling No Dialect - Dialect & Social -0.005 [-0.09, 0.06] 0.656 [-0.08, 0.07] 0.551
Spelling No Dialect - Dialect Literacy -0.018 [-0.12, 0.05] 0.586 [-0.10, 0.06] 0.658
Spelling Dialect - Dialect & Social 0.012 [-0.08, 0.12] 0.511 [-0.09, 0.12] 0.583
Spelling Dialect - Dialect Literacy -0.001 [-0.11, 0.12] 0.482 [-0.11, 0.11] 0.506
Spelling Dialect & Social - Dialect Literacy -0.013 [-0.12, 0.08] 0.536 [-0.11, 0.08] 0.600
Note:
ROPE range = [-0.035, 0.035]. ROPE determined at the 90% CI of the HDI.

There are no reliable differences in nLEDs for Novel Words across Variety Exposure conditions within each task. Here, all 90% credible intervals span both sides of zero and all pds are less than or equal to 0.772. This suggests that there are no reliable differences across conditions in how novel words are decoded across each task. This indicates that exposure to a dialect (in any of these forms) does not have a negative impact on novel word decoding, even when compared to the No Dialect condition.

Exploratory Covariate Testing Model

We performed a series of exploratory analyses testing whether or not the effects described above may be modulated by how well learners entrenched the language prior to learning how to read and spell using the language. It is expected that if learners have internalised the language more during the exposure phase then the variety they were exposed to should be more readily available during the training and testing phases. In the dialect conditions this should cause greater competition between the dialect and standard forms of the language during both phases. However, if the language is not well entrenched during the exposure phase, it is likely that the dialect form of the language is less accessible during training and testing, resulting in little to no competition between the dialect and standard forms of the language.

Using the vocabulary testing performance as a proxy to entrenchment for the language variety during the exposure phase, we would predict that as mean performance in the vocabulary test improves overall performance in the testing phase improves, but crucially that as mean performance in the vocabulary test improves performance will be worse for contrastive words relative to non-contrastive words in the dialect exposure conditions. However, in the No Dialect condition we would predict no such word type effects. Similarly, if contrary to previous findings (e.g. Williams et al., 2020) exposure to a dialect can indeed affect novel word decoding, we would predict that decoding for novel words would only be impaired in the dialect conditions if the dialect form of the language was sufficiently entrenched (i.e. when vocabulary test performance is relatively good).

As with previous models, draws from the posterior for different combinations of conditions were taken using the tidybayes R-package. Similarly, hypothesis tests are provided in the form of Region of Practical Equivalence (ROPE) analyses from these draws using the bayestestR R-package. Caution is needed for interpreting such hypothesis tests as the following models are exploratory.

Word Type by Task, Variety Exposure, and Continuous Effects of Vocabulary Test Performance

We first explored whether mean vocabulary test performance (i.e. in terms of mean nLED) predicts testing performance, and whether or not this varies across Task, Variety Exposure condition, and Word Type. A plot of this relationship is shown below.

Observing the figure above, there is a clear effect in the Dialect and Dialect & Social conditions by which participants with high nLEDS (indicating poorer performance) on the vocabulary testing phase perform equally poorly when reading non-contrastive and contrastive words. However, participants with low nLEDs (indicating better performance) on the vocabulary testing phase perform better with non-contrative words relative to contrastive words in the dialect conditions. In the Dialect Literacy condition, while performance is consistently poorer on contrastive words relative to non-contrastive words, the effect is pronounced for those who performed better in the vocabulary testing phase when compared to those who performed worse in the vocabulary testing phase across both tasks. Crucially, in the No Dialect condition even those with better performance in the vocabulary testing phase perform equally well with non-contrastive and contrastive words.

Comparing the effects across conditions, it is clear that performance is generally worse in the dialect conditions for contrastive words irelative to non-contrastive words. For these non-contrastive words perfomance is comparable to that in the No Dialect condition. This indicates a localised cost to performance for contrastive words – rather than a boost to performance for npn-contrastive words – in the dialect conditions. Similar effects are shown in the Dialect Literacy condition in the spelling task, while performance is equivalent for each word type for the remaining three conditions.

To better demonstrate this effect, we performed a median split based on the vocabulary test performance, and we categorised these into participants with high and low nLEDs in the vocabulary test relative to the median score.

Word Type by Task, Variety Exposure, and a Median Split of Vocabulary Test Performance

We tested whether there are any differences in performance for participants who did poorly or well in the vocabulary testing phase relative to the median for contrastive and non-contrastive words split by task and variety exposure in the testing phase.

Posterior medians with 80% and 90% credible intervals are shown for those who did poorly and well in the vocabulary testing phase for each word type within each task and variety exposure condition in the testing phase below.

Posterior means and credible intervals are provided in the table below.

Exposure Test nLED Group Task Variety Exposure Word Type Median Percentile Interval
Better Performance Reading No Dialect Non-Contrastive 0.096 [0.036, 0.179]
Better Performance Reading No Dialect Contrastive 0.102 [0.040, 0.187]
Better Performance Reading Dialect Non-Contrastive 0.098 [0.035, 0.191]
Better Performance Reading Dialect Contrastive 0.161 [0.086, 0.236]
Better Performance Reading Dialect & Social Non-Contrastive 0.105 [0.037, 0.201]
Better Performance Reading Dialect & Social Contrastive 0.161 [0.082, 0.243]
Better Performance Reading Dialect Literacy Non-Contrastive 0.103 [0.043, 0.176]
Better Performance Reading Dialect Literacy Contrastive 0.220 [0.142, 0.283]
Better Performance Spelling No Dialect Non-Contrastive 0.209 [0.132, 0.278]
Better Performance Spelling No Dialect Contrastive 0.242 [0.168, 0.311]
Better Performance Spelling Dialect Non-Contrastive 0.221 [0.138, 0.299]
Better Performance Spelling Dialect Contrastive 0.212 [0.125, 0.301]
Better Performance Spelling Dialect & Social Non-Contrastive 0.232 [0.135, 0.313]
Better Performance Spelling Dialect & Social Contrastive 0.239 [0.151, 0.318]
Better Performance Spelling Dialect Literacy Non-Contrastive 0.234 [0.148, 0.315]
Better Performance Spelling Dialect Literacy Contrastive 0.296 [0.212, 0.376]
Worse Performance Reading No Dialect Non-Contrastive 0.232 [0.166, 0.315]
Worse Performance Reading No Dialect Contrastive 0.244 [0.175, 0.335]
Worse Performance Reading Dialect Non-Contrastive 0.255 [0.179, 0.352]
Worse Performance Reading Dialect Contrastive 0.270 [0.205, 0.355]
Worse Performance Reading Dialect & Social Non-Contrastive 0.264 [0.190, 0.356]
Worse Performance Reading Dialect & Social Contrastive 0.291 [0.214, 0.402]
Worse Performance Reading Dialect Literacy Non-Contrastive 0.218 [0.154, 0.307]
Worse Performance Reading Dialect Literacy Contrastive 0.296 [0.229, 0.394]
Worse Performance Spelling No Dialect Non-Contrastive 0.293 [0.239, 0.361]
Worse Performance Spelling No Dialect Contrastive 0.298 [0.244, 0.371]
Worse Performance Spelling Dialect Non-Contrastive 0.335 [0.264, 0.450]
Worse Performance Spelling Dialect Contrastive 0.341 [0.267, 0.456]
Worse Performance Spelling Dialect & Social Non-Contrastive 0.339 [0.277, 0.420]
Worse Performance Spelling Dialect & Social Contrastive 0.338 [0.265, 0.457]
Worse Performance Spelling Dialect Literacy Non-Contrastive 0.290 [0.231, 0.362]
Worse Performance Spelling Dialect Literacy Contrastive 0.349 [0.275, 0.470]

Overall, performance is generally better in the testing phase for participants who did well in the vocabulary test when comapred to those who did poorly in the vocabulary test for each word type.

To explore whether vocabulary testing performance has any effect on peformance for non-contrastive and contrastive words within each task and variety exposure condition, we used a similar method as in previous analyses to compare draws from the posterior.

This plot shows a similar effect to that described for the continuous plot above. Namely, that reading performance is worse for contrastive words relative to non-contrastive words in the Dialect and Dialect & Social conditions for participants who performed well in the vocabulary testing phase. For the Dialect Literacy condition, there is clear evidence that performance is worse for contrastive relative to non-contrastive words in the both the reading and spelling tasks. However, this effect is both (a) larger in the reading task relative to the spelling task, and (b) larger for those who performed well in the vocabulry testing phase relative to those who performed poorly in the vocabulary testing phase for the reading task.

A direct comparison of the differences in performance for contrastive words relative to non-contrastive words split by vocabulary testing performance, variety exposure, and word type is provided in the table below.

Task Exposure Test nLED Group Variety Exposure Word Type Median Percentile Interval Rope Percentage HDI Interval P(Direction)
Reading Better Performance No Dialect Contrastive - Non-Contrastive 0.006 [-0.03, 0.05] 0.627 [-0.03, 0.04] 0.597
Reading Better Performance Dialect Contrastive - Non-Contrastive 0.060 [0.02, 0.10] 0.009 [0.02, 0.10] 0.992
Reading Better Performance Dialect & Social Contrastive - Non-Contrastive 0.053 [0.01, 0.10] 0.079 [0.00, 0.10] 0.974
Reading Better Performance Dialect Literacy Contrastive - Non-Contrastive 0.114 [0.07, 0.16] 0.000 [0.07, 0.16] 1.000
Reading Worse Performance No Dialect Contrastive - Non-Contrastive 0.012 [-0.04, 0.06] 0.507 [-0.04, 0.06] 0.660
Reading Worse Performance Dialect Contrastive - Non-Contrastive 0.015 [-0.04, 0.07] 0.475 [-0.04, 0.06] 0.664
Reading Worse Performance Dialect & Social Contrastive - Non-Contrastive 0.027 [-0.03, 0.08] 0.373 [-0.03, 0.09] 0.773
Reading Worse Performance Dialect Literacy Contrastive - Non-Contrastive 0.079 [0.02, 0.14] 0.000 [0.02, 0.14] 0.990
Spelling Better Performance No Dialect Contrastive - Non-Contrastive 0.032 [-0.02, 0.09] 0.317 [-0.02, 0.09] 0.847
Spelling Better Performance Dialect Contrastive - Non-Contrastive -0.008 [-0.06, 0.04] 0.535 [-0.06, 0.04] 0.595
Spelling Better Performance Dialect & Social Contrastive - Non-Contrastive 0.009 [-0.05, 0.06] 0.431 [-0.05, 0.07] 0.593
Spelling Better Performance Dialect Literacy Contrastive - Non-Contrastive 0.063 [-0.00, 0.12] 0.104 [-0.00, 0.12] 0.936
Spelling Worse Performance No Dialect Contrastive - Non-Contrastive 0.006 [-0.05, 0.06] 0.519 [-0.04, 0.06] 0.574
Spelling Worse Performance Dialect Contrastive - Non-Contrastive 0.006 [-0.05, 0.07] 0.436 [-0.05, 0.07] 0.571
Spelling Worse Performance Dialect & Social Contrastive - Non-Contrastive 0.000 [-0.06, 0.07] 0.450 [-0.06, 0.06] 0.504
Spelling Worse Performance Dialect Literacy Contrastive - Non-Contrastive 0.059 [0.00, 0.14] 0.131 [-0.00, 0.13] 0.955
Note:
ROPE range = [-0.02, 0.02]. ROPE determined at the 90% CI of the HDI.

The comparison for performance across both word types in the posterior summary corroborates the conclusions drawn from the plots above. It is likely that the strong effects shown in both tasks for those with poorer and better performance in the vocabulary testing phase in the Dialect Literacy condition likely reflects the fact that this condition interleaves the dialect form of the language with the standard form of the language during training (rather than front-loaded prior to the vocabulary test), which allows for sufficient entrenchment of the dialect form of the language, causing a great deal of local interference in both tasks.

Novel Words by Task, Variety Exposure, and Continuous Effects of Vocabulary Test Performance

We next focussed on exploring whether decoding for novel words is affected by task, variety exposure condition, and performance ib the vocabulary testing phase. The following analyses summarise these effects in the covariate testing model.

Similarly to the analysis by word type, we again performed a median split based on vocabulary testing performance to better highlight any effects of vocabulary testing performance on novel word decoding within each task and variety exposure condition.

Novel Words by Task, Variety Exposure, and a Median Split of Vocabulary Test Performance

Posterior medians with 80% and 90% credible intervals are shown for those who did poorly and well in the vocabulary testing phase for novel words within each task and variety exposure condition in the testing phase below.

Posterior medians and credible intervals are provided in the table below.

Task Exposure Test nLED Group Variety Exposure Median Percentile Interval
Reading Better Performance No Dialect 0.176 [0.091, 0.277]
Reading Better Performance Dialect 0.179 [0.095, 0.255]
Reading Better Performance Dialect & Social 0.181 [0.102, 0.263]
Reading Better Performance Dialect Literacy 0.207 [0.112, 0.294]
Reading Worse Performance No Dialect 0.251 [0.181, 0.351]
Reading Worse Performance Dialect 0.266 [0.196, 0.345]
Reading Worse Performance Dialect & Social 0.233 [0.163, 0.330]
Reading Worse Performance Dialect Literacy 0.273 [0.197, 0.375]
Spelling Better Performance No Dialect 0.225 [0.135, 0.294]
Spelling Better Performance Dialect 0.249 [0.152, 0.376]
Spelling Better Performance Dialect & Social 0.199 [0.085, 0.311]
Spelling Better Performance Dialect Literacy 0.262 [0.169, 0.361]
Spelling Worse Performance No Dialect 0.297 [0.237, 0.367]
Spelling Worse Performance Dialect 0.343 [0.249, 0.518]
Spelling Worse Performance Dialect & Social 0.335 [0.250, 0.475]
Spelling Worse Performance Dialect Literacy 0.316 [0.227, 0.497]

While performance is generally worse in the testing phase for those with poorer vocabulary test performance when compared to those with better vocabulary test performance, we used the same methods as in previous analyses to establish whether there are any differences in novel word decoding across variety exposure conditions depending upon the task and vocabulary testing performance.

Posterior medians with 80% and 90% credible intervals are shown below comparing performance for novel words across variety exposure conditions within each task and within those with poorer and better performance in the vocabulary testing phase.

Posterior medians and credible intervals for this comparison are provided in the table below.

Task Exposure Test nLED Group Variety Exposure Median Percentile Interval Rope Percentage HDI Interval P(Direction)
Reading Better Performance Dialect - No Dialect 0.002 [-0.08, 0.07] 0.371 [-0.08, 0.08] 0.518
Reading Better Performance Dialect & Social - No Dialect 0.006 [-0.08, 0.09] 0.327 [-0.09, 0.09] 0.544
Reading Better Performance Dialect Literacy - No Dialect 0.030 [-0.07, 0.12] 0.275 [-0.06, 0.13] 0.714
Reading Better Performance Dialect & Social - Dialect 0.005 [-0.07, 0.08] 0.382 [-0.07, 0.08] 0.541
Reading Better Performance Dialect Literacy - Dialect 0.026 [-0.05, 0.11] 0.312 [-0.05, 0.11] 0.708
Reading Better Performance Dialect Literacy - Dialect & Social 0.024 [-0.07, 0.11] 0.283 [-0.06, 0.12] 0.675
Reading Worse Performance Dialect - No Dialect 0.014 [-0.08, 0.09] 0.326 [-0.08, 0.09] 0.608
Reading Worse Performance Dialect & Social - No Dialect -0.018 [-0.11, 0.08] 0.319 [-0.11, 0.08] 0.666
Reading Worse Performance Dialect Literacy - No Dialect 0.023 [-0.08, 0.12] 0.276 [-0.08, 0.12] 0.647
Reading Worse Performance Dialect & Social - Dialect -0.034 [-0.11, 0.06] 0.246 [-0.12, 0.05] 0.727
Reading Worse Performance Dialect Literacy - Dialect 0.009 [-0.08, 0.11] 0.281 [-0.09, 0.10] 0.567
Reading Worse Performance Dialect Literacy - Dialect & Social 0.039 [-0.06, 0.14] 0.231 [-0.06, 0.14] 0.741
Spelling Better Performance Dialect - No Dialect 0.025 [-0.05, 0.13] 0.296 [-0.06, 0.12] 0.693
Spelling Better Performance Dialect & Social - No Dialect -0.027 [-0.10, 0.06] 0.283 [-0.11, 0.04] 0.726
Spelling Better Performance Dialect Literacy - No Dialect 0.039 [-0.04, 0.13] 0.246 [-0.05, 0.12] 0.786
Spelling Better Performance Dialect & Social - Dialect -0.053 [-0.16, 0.04] 0.194 [-0.16, 0.05] 0.832
Spelling Better Performance Dialect Literacy - Dialect 0.013 [-0.11, 0.12] 0.257 [-0.09, 0.13] 0.582
Spelling Better Performance Dialect Literacy - Dialect & Social 0.069 [-0.04, 0.17] 0.135 [-0.03, 0.18] 0.861
Spelling Worse Performance Dialect - No Dialect 0.045 [-0.05, 0.22] 0.228 [-0.07, 0.18] 0.757
Spelling Worse Performance Dialect & Social - No Dialect 0.041 [-0.04, 0.16] 0.218 [-0.06, 0.14] 0.771
Spelling Worse Performance Dialect Literacy - No Dialect 0.020 [-0.08, 0.18] 0.276 [-0.10, 0.15] 0.630
Spelling Worse Performance Dialect & Social - Dialect -0.006 [-0.17, 0.13] 0.228 [-0.17, 0.13] 0.530
Spelling Worse Performance Dialect Literacy - Dialect -0.024 [-0.19, 0.14] 0.188 [-0.20, 0.13] 0.601
Spelling Worse Performance Dialect Literacy - Dialect & Social -0.018 [-0.16, 0.15] 0.188 [-0.18, 0.13] 0.581
Note:
ROPE range = [-0.02, 0.02]. ROPE determined at the 90% CI of the HDI.

From the plot and table, it is clear that in both those with poorer and better vocabulary testing performance decoding for novel words is equivalent across variety exposure conditions within each task. This suggests that regardless of how well entrenched the language may be, there are no substantial differences in novel word decoding within each variety exposure condition for each task.

Summary of Results

Together, these findings suggest that while there were no substantial differences in performance by word type within each variety condition during the vocabulary testing phase, after reading and spelling training reading performance was worse for contrastive words relative to non-contrastive words in the dialect conditions. This effect was also manifested in the spelling task when participants learned to read and spell in both the standard and dialect varieties of the language. This suggests that exposure to a dialect – and particularly learning to read and spell in a dialect in addition to the standard variety of a language – confers a local cost to processing contrastive words relative to non-contrastive words. However, the crucial question is whether or not exposure to a dialect also impedes performance in reading and spelling novel, untrained words (analagous to non-word reading tests in natural languages). Supporting findings by Williams et al. (2020), we found no evidence that exposure to a dialect influences learning to read and spell novel words.

Across all analyses we found consistent evidence that performance was generally worse in the spelling task relative to the reading task, presumably because reading supports a large range of strategies for word recognition (e.g. complete grapheme-phoneme conversion, partial decoding, or direct access to the depicted meaning) and thus production, while spelling supports only phoneme-grapheme conversion.

Exploratory analyses revealed that the local deficit to processing contrastive words in the dialect conditions likely arises only when the dialect form of the language is sufficiently entrenched. Indeed, in the Dialect and Dialect & Social conditions reading was only worse for contrastive words relative to non-contrastive words where performance was relatively good in the vocabulary test (i.e. with nLEDs below the sample median), indicating that the dialect form of the language was learned well following exposure to the language. However, in the Dialect Literacy condition both reading and spelling was worse for contrastive words relative to non-contrastive words. Yet, this effect was stronger in the reading task for those with better performance in the vocabulary test relative to those with worse performance in the vocabulary test. This is likely due to the Dialect Literacy condition providing further opportunities to entrench the dialect form of the language, such that both better entrenchment of the language during exposure and Dialect Literacy training have additive effects in entrenching the dialect form of the language, which is reflected in a larger localised cost to processing contrastive words relative to non-contrastive words. One implication of this finding is that those who have more strongly entrenched a dialect in the home environment may be more at risk of any deleterious effects of dialect exposure on reading words with a dialect variant when the standard pronunciation is expected or required. However, again there was no evidence that exposure to a dialect confers any disadvantage to reading or spelling novel words, regardless of how well and which form of the language is entrenched in the vocabulary test.